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The question we address in this paper is: how to perform morphological analysis in the framework 
of word-based morphology, that is without resorting to the notions of morpheme, affix, morphological 
exponent or any representation of these concepts? We do not present here a fully fledged answer, but 
we describe a general framework for doing so and a method for computing a large part of the intended 
analysis. The paper is divided into five parts. In section[Tl we outline the objectives of the research and 
the method. We then detail the measure of morphological similarity (section|2l), and, the formal analogy 
we use to filter the morphological neighborhoods (section[3]l. We then present some preliminary results 
(sectionHJi and a short conclusion (section|5]l. 

1. Toward a computational word-based morphology 

1.1. Word-based vs morpheme-based morphology 

In standard morpheme-based morphology, words are made up of morphemes. The morphemes are 
combined by rules of inflection, derivation and composition. They have structures which are usually 
represented as trees like the ones in figu re [H M orpheme- based morpho logy is both elegant and easy to 
use, but it suffers from many drawbacks (I Anderson^ , 1 992t I Aronoffl 1 1 994) : there is no need to enumerate 
them here. 
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Figure 1: Word structure of the French noun derivabilite 'derivability' and the French adjective deriva- 
tionnel 'derivational' in morpheme-based morphology. 



In word-based morphology dAronofj. 1 19761: iBvbeel Il988i 1 19951: iNeuvel & Siiighi I2OOII: iBurzio , 
2002l:lBlevinsil200 6V the minimal units are the words. Therefore, they do not have any structure. Mor- 
phological structure then becomes a level of organization of the lexicon, made up of the morphological 
relations that hold between the words. Some of them play a special role, namely the relations between 
the words that belong to the same lexeme, to the same inflectional series, to the same morphological fam- 
ily and to the same derivational series. These four types of aggregates can be illustrated by the lexeme 
and the inflectional series of the French verb form demons 'derive' and by the morphological family 
and the derivational series of the deverbal noun derivation: 

• the lexeme of derivons contains the inflected forms of the verb deriver 'derive': derive r, derive, 
deriverez, derivaient, derivees, derivions, etc.; 

• derivons belongs to an inflectional series of first person plural, present indicative verb forms 
which also contains acclimatons 'acclimate', compilons 'compile', eduquons 'educate', localisons 
'localize', varions 'vary', etc.; 

• the morphological family of derivation contains words such as deriver, derivable, derivatif 
'derivative', derivationnel, derivabilite, derive 'drift', deriveur 'sailing dinghy', etc.; 
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• derivation belongs to a derivational series of deverbal nouns in -ion such as acclimatation 'accli- 
mation', compilation, education, localisation, variation, etc. 



In the rest of the paper, we concentrate only on the derivational part of the morphological structure. 

Notice that morphology does not reduce to this lexical structure. For instance, anti universal health- 
care in ([T) is a morphological construed that is not likely to enter the lexicon nor have a place in the 
structure. 

(1) All those anti feminist, anti Democrat, anti giving everyone the right to vote, anti universal health- 
care, anti all kinds of things I thought no one was anti. 
[www ■ talkleft ■Com/story/2 008/ 4/18/2 04 142/ 3 621 

Actually, lexicon and morphology must be clearly separated: the main function of the lexicon is to mem- 
orize and organize the words that a speaker knows; the main function of the morphology is to produce 
and analyze words. The constructs produced by the morphology are designed to enter the lexicon and 
extend the lexical structure. In return, the lexicon provides the morphology with the information it needs 
to create and analyze morphological constructs. However, the distribution of morphological informa- 
tion between lexicon and morphology varies according to morphological theories. In morpheme-based 
morphology, each word has a separate structure, the lexicon is just a bag of morphemes (and possibly 
of fully analyzed words) and morphological rules encode the bulk of the morphological information. In 
word-based morphology, the distribution of the information is more even. The lexicon contains both the 
words and the morphological relations that hold between them. Morphology is made up of processes 
that extend the existing lexical structure with new words. These processes can also be used to create 
constructs such as ^ that have no place in the lexicon. In this paper, we are concerned only with the 
lexicon structure. 

The morpheme-based vs word-based distinction shows up on the computational level. In the 
morpheme-based conception, the morphological analysi s of a word aims a t segmenting it into a se- 
quence of morp hemes (Dejean, 1998; Gaussier, 1999; ISchone & Jurafskvl |2000; Goldsmithl l200l[ 



Creutz & Lagusl 120021: iBernhardi l2006l) . For instance, derivation is analyzed as made up of two seg- 



ments derivat and -ion, the first being identified as the root morpheme deriv and the second as the suffix 
-ion. In a word-based approach, the aim of a morphological analysis is to discover the relations between 
the word and the other lexical items and to identify its morphological family and its derivational se- 
ries. For instance, an analysis of the French word derivation is satisfactory if it connects derivation with 
enough members of its morphological family {deriver, deriv ationnel, derivable, etc.) and its derivational 
series {formation 'education', seduction, emission, vision, etc.). 

The morphological relations are organized into analogical series. For instance, the relation between 
derivation and derivable is the basis of analogies such as derivationiderivable :: variationivariable^ 
derivation:derivable :: tnodification:modifiable, derivation:derivable :: adaptation:adaptable, deriva- 
tion:derivable :: observatiomobservable, etc. Similarly, the relation between derivation and variation 
gives rise to a series of analogies such as derivationwariation :: deriver.varier, derivationwariation :: 
derivationetvariationnel, derivationwariation ;: derivabilitewariabilite, derivationwariation :: deriv- 
ablewariable, etc. These examples show how morphological analogies connect the morphological fam- 
ilies and the derivational series. 

1.2. Combining morphological relatedness and formal analogy 

In the remainder of the paper, we present a computational model that makes the morphological 
derivational structure of the lexicon emerge from the semantic and the formal regularities of the words it 
contains. A first experiment is currently underway on the lexicon of French using the Tresor de la Langue 
Frangaise informatise machine readable dictionary (or TLFi for short; atilf.atilf.fr/tlf. htm). 
Our aim is to create a lexicon that provides the morphological family and the derivational series of the 
words it contains. This morphological lexicon owes its strength to the global description of a significant 



^Construct is used in this paper as a generic term to designate any linguistic object produced by the morphology. 
^The notation a : 6 :: c ; d is used as a shorthand for the statement that (a, b, c, d) forms an analogical quadruplet, 
or in other words that a is to 6 as c is to d. 



part of the French lexicon. We are building it from a lexicographical resource because we need semantic 
descriptions for a large number of words. We are fully aware of the limitations of the lexicographical 
descriptions but the benefits of using dictionaries far exceed them. Dictionaries provide definitions and 
graphemic / phonological representations for a significant part of the lexicon. Besides, lexicographic 
descriptions are more easy to use than data extracted from corpora since they only present the sub-senses 
and the definitions of the most representative usages of the words. 

Our method relies on a measure of morphological relatedness that brings the members of morpho- 
logical families and derivational series closer This measure takes into account both the formal and the 
semantic similarities between the words. The method also relies on the discovery of formal analogies 
among morphological ne i ghbor s . The use of analogy is quite common in computational morphology 
('Skousen' '1989"; 'Lepa gel 119981; IVan den Bosch & Daelemaiisl. 1 19991: IPirrelli & Yvoiii 1 19991: iHathoul , 



[2005: Stroppa & Yvon, l2005h . The main novelty of the method is to combine it with a measure of 



morphological relatedness. First, lexical similarity is used in order to select quadruplets of words that 
are related to each other The candidates are then checked by means of analogy. The two techniques 
are complementary. Morphological similarity can be computed for large numbers of words, but it is too 
coarse-grained to discriminate between the words that are actually morphologically related and the ones 
that are not. Formal analogy is then used to perform fine grained filtering but is costly to calculate. 
More generally, our approach is original in that; 

1 . The computational model is purely word-based. The discovery of morphological relations be- 
tween words do not involve the notions of morpheme, affix, morphological exponent, etc. or any 
representation of these concepts. 

2. Membership in families and series is gradient. It accounts, for instance, for the fact that deriveur 
is morphologically and semantically closer to derive than to derivationnellement 'derivationally', 
even if the three words belong to the same family. The model connects the words that share 
semantic and / or formal features. The more features they share and the more specific these features 
are, the closer the words are. 



3. It implements the theoretical proposals of Bvbee ( 1988 , 1995 ) and Burziol ( 2002 ) in a straightfor- 
ward manner. 

4. It is efficient enough to be used to build a large morphological resource semi-automatically. 

Besides, the model integrates semantic and formal information in a uniform manner All kinds of se- 
mantic information (lexicographic definitions, synonyms, synsets, etc.) and formal information (phono- 
logical, graphemic, syllabic, etc.) can be used. These specifications can be cumulated easily in spite of 
differences in nature and origin. The model takes advantage of the redundancy of the features and is 
fairly insensitive to variation and exceptions. It is robust and language independent. 
Technically, the model joins; 

1 . the repr esentation of the lexicon a s a graph and its ex ploration through random walks, along the 
fines of jGaume et all l2002i 120051: iMuller et al.[ l2006h . and 



2. formal analogies on words ( Lepage , 1998 , 2003 ; Stroppa & YvonL 2005 ; Langlais & Patry , 2007 ). 
This approach does not make use of morphemes. Correspondence between words is calculated 
directly on their graphemic representations. 



1.3. Network lexicon 



The morphological lexicon we intend to build is a network of words with connections mainly de- 
fined by the morphological families and the derivational series. This primary structure is completed with 
a set of analogies between pairs of morphologically related lexemes and with a morphological distance. 
The resulting lexicon is remarkably flexible and can adequately represent various morphological phe- 
nomena. One of them is allomorphy, which corresponds to locations in the network where there is a 
mismatch between the formal analogies and the organization into families and series. For instance, the 
French deverbal noun denivellation 'unevenness' can be identified as an allomorphic form because (/) it 



belongs to a series of words ending in -ion, (ii) it is a member of the family of the verb deniveler 'make 
uneven' and, (///) it is morphologically the closest noun to this verb. Nouns in -ion and more specifically 
in -ation are normally involved in analogies with their closest verbs such as derivation-.deriver :: compi- 
lation:compiler. The absence of such analogies for denivellation appears as a gap in the analogical grid. 
This gap is the sign of an allormorphy. Another cue is the near identity of denivellation with the string 
denivelation which would have allowed deniveler to enter the main set of analogies involving the nouns 
ending in -ion (i.e. the set of analogies with the strongest morphological density). 

The lexicon also accounts for the simila rity and difference between curieux 'curious' and furieux 
'furious' in the same way ( Jackendofij 1 1 975 ) . On the one hand, furieux can be analyzed as an adjective 



derived from the noun furie 'fury' but we cannot do so for curieux since it is no longer semantically 
related to cure 'care'. On the other hand, both adjectives have the formal and the semantic features of 
-eux derivatives. In the lexicon we propose, both adjectives belong to the same derivational series. On 
the other hand, furieux and furie participate in a series of analogies with melodieux:melodie 'melodi- 
ous': 'melody', harmonieux-.harmonie 'harmonious': 'harmony', /acef/eMJci/acef/e 'facetious': 'joke', etc. 
while curieux does not. This example shows the flexibility of our model and the higher descriptive preci- 
sion we obtain from the derivational series and the morphological analogies. By contrast, the similarity 
of curieux withfurieux cannot be described in a morphematic model or in any model lacking derivational 
series. Note that the term "derivational series" is a little misleading since series include both derived and 
non derived lexemes. Lexemes belong to the series on the basis of their form and meaning only. 

Similarly, the representation of words that include interfixes suc h as ta rtelette 'little tart', gouttelette 



'droplet', or vedettariat 'stardom' (IPlena t & Roche, 2003; Plenat, 120051) does not pose any difficulty 



These words are full members of their respective families and series. In these series, each of them is 
the nearest neighbor of its base tarte 'tart', goutte 'drop' and vedette 'star'. The interfixes reinforce the 
formal integration of these lexemes in their series. 

With respect to applications, the lexicon we propose adequately fulfills the main requirements for 
morphological knowledge in computational linguistics and information retrieval. Morphological re- 
sources have seve ral uses in these dom ains, such as prepositional phrase attach ment disambiguation 
( Bourigaull 2007 ) or query expansion ( IXu & Croftl 1 19981: Ijing & Tzoukerman . 1999; Moreauetafl 



20071) . The morphological relations used by a syntactic parser such as Syntex iBourig ault. ,20071^ 



as- 



sociate nouns and verbs from the same family with strong morphological similarities. Our lexicon will 
provide all these relations and even allow the users to select them with more precision. In information 
retrieval, the retrieval performance can be improved by expanding the queries by adding to them morpho- 
logically related words. These words are all members of the morphological families of the words of the 
seed queries. Besides, the morphological distance we propose can be used to tune the expansions more 
finely. Our lexicon can also be used in the design of psycholinguistic experimental material. The derived 
vs non derived nature of the words can be determined from their derivational series and their morpholog- 
ical analogies. Among the other features taken into account for the conception of experimental material, 
let us cite formal likeness andmembership in the same family. All this information is explicitly available 
in the lexicon we propose. Finally, let us stress that the relational organization of t he lexicon does no t 
pose any difficulty as proved by the number of the appUcations which use WordNet ( Miller et al. , 1990l) . 



1.4. Related works 

In this research, we adopt a global approach to the lexicon which differs from other effor t s such 
as the MorTAL pro ject aiming at creating a morphological database for French (iDal et al. , 119991: 



Hathout et al.L 120021) . In this project, the database is made up by analyzing a selection of French af- 
fixes, one at a time, by means of the Derif analyzer (Namer, 2005). By contrast, our objective is to create 
an entire lexicon at once. 

Many works in the field of computational morphology aim to recover relations between lexical units. 
All of them rely primarily on finding similarities between the wor d graphe mic forms. These r elations 
are m ainly prefixal or suffixal with two exceptions, (lYarowsky & Wicentowskil i2000l) and dBaroni et al.[ 



20021) . who use string edit distances to estimate formal similarity. As far as we know, all the others 



per form some sort of seg mentation even when the goal is not to find morphemes, as in (.Hatho ut. 20q3) 



or jNeuvel & Fulop , E002h . The model we propose differs from these approaches in that the graphemic 



similarities are determined solely on the basis of the sharing of graphemic features. This is the main 
contribution of this paper 

This model is also related to approaches that combine graphemic and semantic cues in order to 
identify morphemes or morphological relations between words. Usually, this semantic information is 
auto matically acquired from corpora by means of various technique s such as latent semantic analysis 



Scho ne & Jurafsky, 2000), mutual info rmation dBaroni et all 120021) or co-occurrence in n-word win 



dows (iXu & Croftl Il998l: IZweieenb aum & GrabaA l2003h . In the experiment presented here, semantic 
information is extracted from a machine readable dictionary and semantic similarity is calculated through 
rand o m wa lks in a lexical graph. The approach presented here can also be compared with (Hathoutl 
I2002H2003I) . where morphological knowledge is acquired by using semantic information extracted from 
dictionaries of synonyms and from WordNet. 

2. Morphological relatedness 

We assume here a minimalist definition of morphological relatedness: two words are morphologi- 
cally related if they share phonological and semantic properties. In the experiment, graphemic properties 
have been used instead of phonological ones because the TLFi does not provide the pronunciation of all 
the headwords. The morphological relatedness is estimated by means of a bipartite graph like the one 
presented in figure |2l with one subset of vertices representing lexemes and the other representing the 
formal and the semantic features of these lexemes. Lexeme vertices are identified by the lemma and the 
grammatical category. 
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Figure 2: Excerpt of the bipartite graph which represents the lexicon. Words are displayed in ovals, 
semantic features in rectangles and formal features in octagons. The graph is symmetric. 



2.1. Formal and semantic features 

The formal properties associated with a lexeme are the rt-grams of letters that occur in its lemma. 
The beginning and the end of the lemma are marked by the character $ . We impose a minimum size on 
the rt-grams {n > 3). For instance, the formal features associated with the French noun orientation are 
the n-grams of figure|3] with n ranging from 13 down to 3. 

Figure [3] shows that the set of features associated with a given word is quite redundant. An interesting 
property of this description is that it does not confer a special status to any of the individual n-grams 
which characterize the lexemes. All n-grams play the same role and therefore none has the status of 
morpheme. These features are only used to bring together the words that share the same sounds. 

Alternatively, one could have used the n-grams that occur in the inflected forms of the lexemes 
as formal features. Such an extended characterization is more faithful to word-based morphology and 



$orientation$ 
$orientation orientationS 
$orientatio orientation rientation$ 
$orientati orientatio rientation ientation$ 

$ori orie rien ient enta ntat tati atio tion ion$ 
$or ori rie ien ent nta tat ati tio ion on$ 



Figure 3: Excerpt of the formal features associated with the noun orientation. 



makes the inflectional allomorphies available at the derivational level. However, we did not retain this 
option because inflectional endings reduce the homogeneity of the formal representations. For instance, 
with a threshold ti > 3, the verb malaxer 'knead' would become connected to all the words that contain 
xie (anxieux, lexie, orthodoxie, etc.) because of its inflected form malaxiez (second person plural, 
imperfect indicative and present subjunctive). In order to avoid giving too much importance to these 
very specific features, it is necessary to weight the contribution of each inflected form with an estimation 
of its frequency, computed for instance from a large text corpus. A form like malaxiez is likely to be 
very rare or even missing from most corpora. In this way, the unwanted connections will be demoted or 
eliminated. 

The semantic features associated with a lexeme are the n-grams of words that occur in its definitions. 
The n-grams that contain punctuation marks, not counting apostrophes, are eliminated. In other words, 
we only use n-grams of words that occur between two punctuation marks. The words in the definitions 
are POS tagged and lemmatized. The tags are A for adjectives, N for nouns, R for adverbs, V for 
verbs and X for all other categories. For instance, the semantic features induced by the definition Acf/o« 
d'orienter, de s'orienter; resultat de cette action 'act of directing, of finding one's way; result of this 
action' of the noun orientation are presented in figure |4] Notice that the semantic features are heavily 
redundant, just as the formal features are. 

N . act ion_X . de_V . orienter N . action_X . de X . de_V . orienter 

N. action X . de V. orienter X . de_V . s ' orienter V. s'orienter 

N . resultat_X . de_X . ce_N . action N . resultat_X . de_X . ce X . de_X . ce_N . action 

N . resultat_X . de X.de_X.ce X . ce_N . action N. resultat X.ce 

Figure 4: Semantic features induced by the definition Acf/on d'orienter, de s'orienter ; resultat de cette 
action of the noun orientation. 



This is a very coarse semantic representation inspired from the repeated segments (iLebart et al.L 119981) . 
It offers several advantages: 

1. being heavily redundant, it can capture various levels of similarity between the definitions; 

2. it integrates information of a syntagmatic nature without a deep syntactic analysis of the defini- 
tions; 

3. it slightly reduces the strong variations in the lexicographical treatment of the headwords, espe- 
ciaUy in the division into sub-senses and in the definitions. 

2.2. Connecting the lexemes through their features 

The semantic and formal features are used in the same graph. The bipartite graph is built up by 
connecting each headword to its semantic and formal features symmetrically. For instance, the noun 
orientation is connected with the formal features $or, $ori, $orie, $orien, etc. which are in 
turn connected with the words orienter, orientable, orientement 'orientation', orienteur 'orientator', 
etc. Likewise, orientation is connected with the semantic features N . action X . de, N . resultat 
X . de X.ce N . act ion, etc. which are themselves connected with the nouns orientement, harmoni- 
sation, pointage 'checking', etc. The general schema is illustrated in figure|2] It shows that the semantic 



and formal proper ties are used i n the same manner. This representation corresponds precisely to the 
Network Model of'Bvbeel (ll988ill995h . 



Actually, the bipartite structure is not essential. All we need is to be able to compute a morphological 
distance between the words. We use a bipartite graph mainly because it allows us to spread an activation 
simultaneously into the formal and the semantic subparts of the graph. The graph is also interesting 
because it contains representations of properties that are useful for morphological studies. They could 
for instance be used to describe the semantics of the -able suffixation or to find the characteristic endings 
of boat names in French (voilier, petrolier, bananier, thonier, sardinier. . . ; patrouilleur, torpilleur, 
caboteur, deriveur, dragueur. . . ). 

2.3. Estimating the morphological similarity between words 

The morphological similarity between a word and its neighbors is estimated by simulating the 
spreading of an activation initiated at the vertex that represents that word. Since the graph is bipar- 
tite, the activation has to be propagated an even number of times. The graph being heavily redundant, 
two steps of propagation are sufficient to obtain the intended proximity estimations. 

For instance, if we want to determine what the closest neighbors of orientation are, we initiate an 
activation at the vertex that represents orientation. Then, this activation is uniformly spread toward the 
formal and semantic features of orientation. In the next step, the activation located on the feature vertices 
is spread toward the lexeme vertices. The greater the number of features shared by a lexeme with orien- 
tation and the more specific these features are, the stronger the activation it receives. The assumption is 
that the strength of the activation is an estimation of the degree of morphological relatedness. 



Technically, th e spreading is simulated as a random walk in the graph (Gaume et al^, 120021 12005 



Muller et al.L l200a) . It is classically computed as a multiplication of the stochastic adjacency matrix of 
the graph. More precisely, let G = (V, E) be a graph consisting of a set of vertices V = {vi, . . . , «„} 
and a set of edges E C V x V. Let A be the adjacency matrix of G, that is a n x n matrix such that 

1 if {vi,Vj) G E and Aij = if {vi,Vj 



stochastic matrix AI: 



Vi e [l,n],yj e [l,n],My ^ 



^ E. We normalize the rows of A in order to get a 

Ai i 



k=l 

Then {M'^)ij is the probability of reaching vertex Vj from the vertex Vi through a walk of n steps. This 
probability can also be regarded as an activation level of node Vj following an n-step spreading initiated 
at node Vi . 

In the experiment presented in this paper, one half of the activation is spread toward the semantic 
features and the other half toward the formal features. The edges of the bipartite graph can be divided 
into three parts E = J U K U L where J contains the edges that connect a headword to a formal feature, 
K the edges that connect a headword to a semantic feature and L the edges that connect a formal or 
semantic feature to a headword. The actual values of M are defined as follows: 
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if Vi is connected to a semantic feature 



otherwise 



if Vi is connected to a formal feature 
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2.4. Morphological neighbors 



The graph used in the experiment was built from the headwords and the definitions of the TLFi. 
We only removed the definitions of non standard uses (old, slang, regionalism, etc.). The extraction and 
cleaning-up of the definitions were carried out in collaboration with Bruno Gaume and Philippe MuUer. 
The bipartite graph was created from 225 529 definitions describing 75 024 headwords (lexemes). They 
induced about 9 million features, 90% of them being associated with only one headword. These features 
were removed because they do not contribute to the connections of different headwords. Table[T]shows 
that this reduction is stronger for the semantic features (93%) than it is for the formal ones (69%). Indeed, 
semantic descriptions show greater variability than formal ones. 



features 


complete 


reduced 


hapax 


formal 


1 306 497 


400 915 


69% 


semantic 


7 650 490 


548 641 


93% 


total 


8 956 987 


949 556 


90% 



Table 1: Numbers of semantic and formal features. 

The use of the graph is illustrated in figure|5] It shows the 40 nearest neighbors of the weih fructifier 
'bear fruit' for three propagation configurations. The first row (form) presents the neighbors of fructifier 
in a graph that only contains formal features. It shows that the members of the morphological family 
tend to appear as the closest neighbors and that the members of the derivational series (i.e. the verbs 
ending in -ifier) are more distant. The members in the second row {sem) have been computed in a graph 
that only contains the semantic features and the ones in the third row {form + sem) in the full graph. 

form V.fructifler N.fructification A.fructiflcateur A.fructiflant A.fructifere 

V.sanctifler V.rectifier A.rectifier V.fructidoriser N.fructidorien N.fructidor 
N.fructuosite R.fructueusement A.fructueux N.rectifieur A.obstructif A.instructif 
A.destructif A.constructif N.infructuosite R.infructueusement A.infructueux 
V.transsubstantifler V.substantifler V.stratifler V.schistifler V.savantifler 
V.refortifier V.ratifler V.quantifler V.presentifler V.pontifler V.plastifler V.notifier 
V.nettifier V.mystifier V.mortifier V.justifier V.idiotifier V.identifier 

sem V.fructifler V.trouver N.missionnaire N.mission A.missionnaire N.saisie N.police 

N.hangar N.dime N.ban V.affruiter N.melon N.saisonnement N.azedarach A.fruitier 
A.bifere V.saisonner N.roman N. troubadour V.contaminer N.conductibilite 
N.alevinage V.profiter A.fructiflant N.pouvoir V.agir N. operation V.placer 
N.rentabilite N.jouissance N.avocat N. report A.fructueux V.toumer V.chiper 
N. economat N. visa N. societe N. reserve N. recreance 

form + sem V.fructifler A.fructiflant N.fructiflcation A.fructiflcateur V.trouver A.fructifere 
V.rectifler V.sanctifler A.rectifier V.fructidoriser N.fructidor N.fructidorien 
N.missionnaire N.mission A.missionnaire A.fructueux R.fructueusement 
N.fructuosite N.rectifieur N.saisie N.police N.hangar N.dime N.ban A.fruitier 
V.affruiter A.instructif A.obstructif A.destructif A.constructif N.conductibilite 
V.saisonner N.melon N.saisonnement N.azedarach A.bifere V.contaminer N.roman 
N. troubadour N.alevinage 

Figure 5: The 40 nearest neighbors of the verb fructifier when the activation is spread only toward the 
formal features in the first row, only toward the semantic ones in the second row and toward both the 
semantic and formal features in the third. Words that belong to the family or series of fructifier are in 
boldface; the others are in italic. 

The first two rows show clearly that formal features are the more predictive ones while semantic features 
are the less rehable ones. These examples provide an insight into some of the features of the morpholog- 



ical families and the derivational series that could be used in order to separate them: families are small 
sets; series are larger sets; families have a strong semantic and formal cohesion; members of a series 
have looser semantic and formal connections. The last two features explain why the members of the 
morphological families tend to show up before the members of the derivational series. The examples 
also show that the morphological similarity is not selective enough and that the list of neighbors cannot 
be used as is. We need to further filter them and we propose to do so with formal analogy. 

3. Analogy 

3.1. Familial and serial analogies 

The members of the series and families are massively involved in analogies which structure the 
lexicon. For instance, /rMcf//zer fructification which belong to the same family form analogies with 
large numbers of pairs of members of other families {rectifier 'correct', rectification), {certifier 'assure', 
certification 'attestation'), {plastifier 'coat with plastic', plastification 'lamination of document'), {sanc- 
tifier, sanctification), {vitrifier, vitrification), etc. Besides, the first elements of each of these pairs belong 
to the series of fructifier and the second ones to the series of fructification. In a dual manner, fructifier 
and sanctifier which belong to the same series form analogies with the members of other series {fructifi- 
cateur 'fructifier', sanctificateur 'sanctifier'), {fructification, sanctification) or {fructifiant 'fructifying', 
sanctifiant 'sanctifying'). These pairs are respectively made of members of the families of fructifier and 
sanctifier. 
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family 



fructification 



fructifier 
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Figure 6: Morphological relations and neighborhood relations between the members of the fructi- 
fier. fructification ;: rectifier.rectification analogy. 

Formal analogies can be used in order to filter the morphological neighbors of a word. Actually, 
we are interested in analogies such as fructifier.fructification :: rectifier.rectification. Since fructification 
belongs to the family of fructifier and rectifier to its series, both are morphological neighbors of fructifier. 
Similarly, rectification belongs to the series of fructification and to the family of rectifier. Therefore, it is 
a morphological neighbor of hot\\ fructification and rectifier. These relations are illustrated in figure |6] 
Conversely, if we consider that the morphological neighbors of a word are likely to be morphologically 
related to that word, then we can use them to look for quadruplets that could form analogies. These 
quadruplets could be found as follows: 

For a given word a, 

look for two of its neighbors h and c, then 
for every d that is a neighbor of both h and c, 

the quadruplet a : b :: c : dis likely to be an analogy. 
More generally, if 6 is a correct morphological neighbor of a, then it is either a member of the family of 
a or a member of its series. Therefore, there exists another neighbor c of a (c belongs to the family of 
a if & belongs to the series of a or vice versa) such that there exists a neighbor dofh and of c such that 
a : b :: c : d. We then have only two configurations: 

1. if 6 e Fa, then 3c e Sa,3d e Sb Ci Fc, a : b :: c : d 

2. ifbe Sa, then 3c e Fa,3d £ Fb n Sc, a : b :: c : d 

where F^ is the morphological family of x and Sx the derivational series of x. 



3.2. Formal analogy 



A formal or graphemic analogy is a relation a : b :: c : d that holds between four strings such that 
the graphemic differences between a and b are the same as the ones between c and d. This is the case 
for fructifier.fructification :: rectifier.rectification (see figure |7]). Naturally, more than one difference 
can appear in the pair as with the four Arabic words kataba:maktoubon :: fa3ala:maf3oulon which 
respectively are transcriptions of the verb 'write', the noun 'document', the verb 'do' and the noun 
'effect.jl The differences between the first two words and between the two last ones can be described as 
in figure|2] They are identical for the two pairs of words. This example shows that even analogies in a 
templatic language like Arabic can be checked in this way. 
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Figure 7: Formal nnnlogies fructifier.fructification :: rectifier.rectification and kataba:maktoubon 
fa3ala:maf3oulon. The differences are located in the frame boxes, e represents the empty string. 



More generally, formal analogies can be defined in terms of factorization (iStroppa & Yvoni 120051) . 
Let L be an alphabet and a ^ L* a string over L. A factorization of a is a sequence / = (/i, • ■ • , fn) G 
L*" such that a = /i © ■ • ■ © /„ where © denotes concatenation. For instance, (ma, k, e, t, ou, b, 
on) is a factorization of length 7 of maktoubon. Morphological analogies can be defined as fol- 
lows. Let (a,6, c, c?) G L*'^ be four strings, a : 6 :: c ; d is a formal analogy iff there exists 
n G N and four factorizations of length n of the four strings (/(a), /(&), /(c), f{d)) G (L*")^ such 
that, Vi G [l,n], {fi{b), fi{c)) G {{fi{a), fi{d)), {fi{d), fi{a))}. For the analogy kataba:maktoubon :: 
fa3ala:maf3oulon, the property holds for n = 7 (see figure|7]l. 

3.3. Implementation 

Formal analogies are checked at the graphemic level. The differences between the first and second 
pairs of strings are calculated from the sequence of string edit operations that transform the first form of 
each pair into the second one. Both sequences must minimize Levenshtein edit distance (i.e. have the 
least cost). Each sequence corresponds to a path in the edit lattices of the pai r of words. The latt i ces ar e 
represented by a matrix computed using the standard string edit algorithm (IJurafsky & Martini l2000l) . 



The path which describes the sequence of string edit operations starts at the last cell of the matrix and 
chmbs to the first one. It is made up as follows: for each cell, select the neighboring one with the least 
cost ; in case of equal costs, prefer the cell to the left (insertion), then the one upward (deletion) and 
otherwise the one in the upper left diagonal direction (substitution). Figure [8] presents the path that is 
selected in the string edit matrix of fructueux 'fruitful' and infructueusement 'fruitlessly' and figure |9] 
the sequence of edit operations for this pair 

Sequences of edit operations can be simplified by merging the series of identical character match- 
ings. The sequence in figure|9]then becomes (|2]i. This simplified sequence is identical to the one for the 
pair soucieux-.insoucieusement 'worried' :'unworriedly' except for the matching operation (|3]l. 

(2) ((I,e,i), (Le,n), (M,f ructueu,f ructueu), (S,x,s), (I,e,e), (Le,m), (Le,e), (I,e,n), (Le,t)) 

(3) ((I,e,i), (Le,n), (M,soucieu,soucieu), (S,x,s), (I,e,e), (Le,m), (I,e,e), (I,e,n), (Le,t)) 

The two sequences can be made identical if the matching sub-strings are not specified (i.e. replaced by a 
wildcard character @). The resulting sequence can then be assigned to both pairs as their edit signatures 
(a). The formal analogy fructueux:infructueusement :: soucieux:insoucieusement can be stated in terms 
of identity of the edit signatures of the two paks (|4|l. 



This example is adapted from examples in lLepagel ( [l998l , l2003h . 
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Figure 8: Least cost path describing a sequence of string edit operations that transforms fructueux 
into infmctueusement. o represents the beginning of the string. Cell in the matrix indicates the 
Levenshtein distance between the substring consisting of the first i characters of fructueux and the one 
consisting of the first j characters of infmctueusement. 
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Figure 9: Sequence of edit operations that transform fructueux into infructueusement. The types of the 
operations are indicated on the first row: D for deletion, I for insertion, M for matching and S for a 
substitution by a different character. 

(4) cr(fructueux, infructueusement) = 
cr(soucieux, insoucieusement) — 

((I,e,i), (I,e,n), (M,@,@), (S,x,s), (I,e,e), (I,e,m), (I,e,e), (I,e,n), (I,e,t)) 
More generally, four strings (a, 6, c, d) G L*'^ form a formal analogy a : 6 :: c : d iff CT(a, b) — a{c, d). 

4. First results 

This is work in progress and we only have preliminary results. We have computed the 100 nearest 
neighbors of the headwords of the TLFi, then collected the formal analogies for 22 headwords belonging 
to 4 morphological families and checked them manually. An analogy a : b :: c : dis accepted as correct 
if: 

• b belongs to the family of a, c belongs to the series of a, d belongs to series of b and to the family 
of c, or 

• b belongs to the series of a, c belongs to the family of a, d belongs to family of b and to the series 
of c. 

We present some examples of correct analogies in (|5]) and erroneous ones in We can see that the 
collected analogies involve words that are derived one from the other (fSal i. words that are derived from a 
common base ( |5bl ) and words connected through a sequence of derivations ( |5c] l. 

(5) a. N.fructification:N. identification :: V.fructifier:V.identifier 

b. A.fructifiant:A.fructificateur :: A.glorifiant :: A.glorificateur 



c. A.fructueux:A.affectueux :: N.infructuosite:N.inaffectuosite 



d. A.frugivore:A.vegetivore :: R.frugalement:R.vegetalement 

e. A.fruitarien:A.vegetarien :: N.fruitarisme:N.vegetarisme 

f. A.fruitier:A.laitier :: N.fruiterie:N.laiterie 

g. R.fructueusement:R.affectueusement :: N.fructuosite:N.affectuosite 

(6) a. A.fruite:N.fraste :: A.truite:N.truste 

b. N.fruit:N.frumentaire :: A.instruit:A.instrumentaire 

c. N.fruiterie:N.friterie :: V.effruiter:V.effriter 

We have tested three configurations (see § |2.4| i. In the first, we have used neighbors from the graph 
that contains the formal features only, in the second, the semantic features only, and in the third, both 
the formal and the semantic features. The results are summarized in table |2l Their quality is quite 
satisfactory. We observe that the number of analogies depends on the configuration of propagation. The 
use of the semantic features improves the precision but reduces the total number of analogies that are 
collected. The best trade-off is a simultaneous propagation toward the semantic and the formal features. 
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1.5% 



Table 2: Number of the analogies collected for a sample of 22 headwords and error rate. 

The performance of the method strongly depends on the length of the headwords because the method 
mainly relies on formal similarity and because formal similarity is stronger for long words. Table[3]show 
this correlation clearly. It presents the number of analogies and the error rate of 13 samples of 5 words, 
selected randomly. The analogies have been collected from neighborhoods in the full graph. The words 
in each group are of the same length. Lengths range from 4 to 16 letters. We can see that the analogies 
collected for words of 10 letters or more are all correct. 
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Table 3: Numbers of analogies and error rates for headwords of length 4 to 16. 



5. Conclusion and directions for further research 



We have presented a computational model that makes the morphological structure of the lexicon 
emerge from the formal and semantic properties of the words it contains. The model is radically word- 
based. It integrates the semantic and formal properties of the words in a uniform manner and represents 
them in a bipartite graph. Random walks are used to simulate the spreading of activations in the lexical 
network. The level of activation obtained after the propagation indicates the lexical relatedness of the 
words. The members of the morphological family and the derivational series of a word are then identified 
among its lexical neighbors by means of formal analogies. 

Let us stress that this method is promising because it is mainly computational. Almost no theoretical 
assumptions have been made. The method primarily exploits the memory and the computing power of 
the processors. Another interesting feature is that the formal and semantic properties of the words are 
represented separately. Therefore, the method deals with so-called parasynthetic derivatives like any 
other lexemes. 

The next steps of this research are to create an initial network with only long words and then use a 
bootstrap method. One important task that remains to be done is to separate the members of the families 
from the ones of the series. We also intend to conduct a similar experiment on t he Enghsh lexicon a nd 



to evaluate our results in a more classical manner by using the CELEX database ( Baayen et al. , 1995|) 



gold standard. Th e evaluation should also be done with respect to w ell-known systems like Linguistica 
dGoldsmithl 1200 lb or the morphological analyzer of lBernhard ( 20061) . 
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